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Abstract 

We propose methodology for estimation of sparse precision matrices and statisti¬ 
cal inference for their low-dimensional parameters in a high-dimensional setting 
where the number of parameters p can be much larger than the sample size. We 
show that the novel estimator achieves minimax rates in supremum norm and 
the low-dimensional components of the estimator have a Gaussian limiting dis¬ 
tribution. These results hold uniformly over the class of precision matrices with 
row sparsity of small order pn/logp and spectrum uniformly bounded, under 
a sub-Gaussian tail assumption on the margins of the true underlying distribu¬ 
tion. Gonsequently, our results lead to uniformly valid confidence regions for 
low-dimensional parameters of the precision matrix. Thresholding the estimator 
leads to variable selection without imposing irrepresentability conditions. The 
performance of the method is demonstrated in a simulation study and on real 
data. 

Keywords: precision matrix sparsity inference asymptotic normality confidence 
regions 

Subject classification: 62J07 62F12 


1 Introduction 

We consider the problem of estimation of the inverse covariance matrix in a high¬ 
dimensional setting, where the number of parameters p can significantly exceed the 
sample size n. Suppose that we are given annxp design matrix X, where the rows of 
X are p-dimensional i.i.d. random vectors from an unknown distribution with mean 
zero and covariance matrix Sq G We denote the precision matrix by ©o := 

assuming the inverse of Sq exists. 

The problem of estimating the precision matrix arises in a wide range of applications. 
Precision matrix estimation in particular plays an important role in graphical models 
that have become a popular tool for representing dependencies within large sets of 
variables. Suppose that we associate the variables Xi,..., Xp with the vertex set 
V = {1, ■ ■ ■ ,p} of an undirected graph G = (V,T) with an edge set S. A graphical 
model G represents the conditional dependence relationships between the variables, 
namely every pair of variables not contained in the edge set is conditionally indepen¬ 
dent given all remaining variables. If the vector (Xi,..., Xp) is normally distributed. 
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each edge corresponds to a non-zero entry in the precision matrix (Lauritzen (1996)). 
Practical examples of applications of graphical modeling include modeling of brain 


connectivity based on FMRI brain analysis (Ng et al., 2013), genetic networks, finan¬ 


cial data processing, social network analysis and climate data analysis. 

A lot of work has been done on methodology for point estimation of precision matri¬ 
ces. We discuss some of the approaches below, but a selected list of papers includes 
for instance Meinshausen and Buhlmann] (2006); Friedman et al 


Levina (2008); Yuan (2010); Cai et al. (2011); Sun and Zhang (2012). A common 


(2008); Bickel and 


approach assumes that the precision matrix is sufficiently sparse and employs the 
^i-penalty to induce a sparse estimator. The main goal of these works is to show 
that, under some regularity conditions, the sparse estimator behaves almost as the 
oracle estimator that has the knowledge of the true sparsity pattern. 

Our primary interest in this paper lies not in point estimation, but we aim to quan¬ 
tify uncertainty of estimation by providing interval estimates for the entries of the 
precision matrix. The challenge of this problem arises since asymptotics of regular¬ 
ized estimators which are the main tool in high-dimensional estimation is not easily 
tractable ( jKnight and Fu 2000), as opposed to the classical setting when the dimen¬ 
sion of the unknown parameter is fixed. 


1.1 Overview of related work 

Methodology for inference in high-dimensional models has been mostly studied in 
the context of linear and generalized linear regression models. From the work on 


linear regression models, we mention the paper by 

Zhang and Zhang 

(2014 

) where a 

semi-parametric projection approach using the Lasso methodology ( 

Tibshirani 

. 1996|) 

was proposed, which was further developed and studied in 

van de Geer et al. 

2013). 


The approach leads to asymptotically normal estimation of the regression coefficients 


and an extension of the method to generalized linear models is given in [van de Geer 


et al. (2013). The method requires sparsity of small order y/n/\ogp in the high¬ 
dimensional parameter vector and uses £i-norm error bound of the Lasso. Further 
alternative methods for inference in the linear model have been proposed and studied 
in [Javanmard and Montanari[( 2014), Belloni et al. (2014) and bootstrapping approach 
was suggested in Chatterjee and Lahiri (2013), Chatterjee and Lahiri (2011). 


Other lines of work on inference for high-dimensional models suggest post-model se¬ 
lection procedures, where in the first step a regularized estimator is used for model 
selection and in the second step e.g. a maximum likelihood estimator is applied on 
the selected model. In the linear model, simple post-model selection methods have 
been proposed e.g. in Javanmard and Montanari (2013), Candes and Tao (2007). 
These approaches are however only guaranteed to work under ir represent ability and 


beta-min conditions (see Buhlmann and van de Geer (2011)). Especially in view of 


inference, beta-min conditions which assume that the non-zero parameters are suffi¬ 
ciently large in absolute value, should be avoided. 

In this paper, we consider estimation of precision matrices, which is a problem re¬ 
lated to linear regression, however, it is a non-linear problem and thus it requires a 
more involved treatment. One approach to precision matrix estimation is based on 
regularization of the maximum likelihood in terms of the ^i-penalty. This approach is 
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typically referred to as the graphical Lasso, and has been studied in detail in several 
papers, see e.g. 


Friedman et al. 

2008 

), Rothman et al. 

(2008 

), 

Ravikumar et al. 


(2008) and Yuan and Lin (2007). Another common approach to precision matrix 
estimation is based on projections. This approach reduces the problem to a series 
of regression problems and estimates each column of the precision matrix using a 


Zhang (2012). 


Lasso estimator or Dantzig selector (( 

Dandes and Tao 

2007 

). The idea was hrst intro- 

duced in 

Meinshausen and Biihlmann 

(2006) as neighbourhood selection for Gaussian 

graphical models and further studied in Yuan 

(2010) 

Cai et al. 

(2011) and 

Sun and 


Methodology leading to statistical inference for the precision matrix has been studied 


only recently. The work Ren et al. (2015) proposes to use a more involved variation 


of the regression approach to obtain an estimator which leads to statistical inference. 
This approach leads to an estimator of the precision matrix which is elementwise 
asymptotically normal, under row sparsity of order -^/n/logp, bounded spectrum 
of the true precision matrix and Gaussian distribution of the sample. The paper 


Jankova and van de Geer (2015) proposes a method for statistical inference based 


on the graphical Lasso. The work introduces a de-sparsihed estimator based on the 
graphical Lasso, which is also shown to be elementwise asymptotically normal. 


1.2 Contributions and outline 

We propose methodology leading to honest confidence intervals and testing for low¬ 
dimensional parameters of the precision matrix, without requiring irrepresentability 
conditions or beta-min conditions to hold. Our work is motivated by the semi¬ 


parametric approach in 

van de Geer et al. 

(2013 

) and is a follow-up of the work 

Jankova and van de Geer 

(2015 

). Gompared to the previous work on statistical 


inference for precision matrices, this methodology has several advantages. Firstly, 
the estimator we propose is a simple modification of the nodewise Lasso estimator 


proposed in Meinshausen and Biihlmann (2006). Hence the estimator is easy to im¬ 


plement and efficient solutions are available on the computational side. Secondly, the 
novel estimator enjoys a range of optimality properties and leads to statistical infer¬ 
ence under mild conditions. Firstly, the asymptotic distribution of low-dimensional 
components of the estimator is shown to be Gaussian. This holds uniformly over 
the class of precision matrices with row sparsity of order o{^/n/ \ogp), spectrum uni¬ 
formly bounded in n and sub-Gaussian margins of the underlying distribution. This 
results in honest confidence regions (Li, [1989 ) for low-dimensional parameters. The 
proposed estimator achieves rate optimality as shown in Section 3.2 Moreover, the 


de-sparsihed estimator may be thresholded to guarantee variable selection without 
imposing irrepresentable conditions. The computational cost of the method is order 
0{p) Lasso regressions for estimation of all parameters and two Lasso regressions for 
a single parameter. 

The paper is organized as follows. Section introduces the methodology. Section 
[3] contains the main theoretical results for estimation and inference and in Section 
|3.4| the suggested method is applied to variable selection. Section provides a com¬ 
parison with related work. Section illustrates the theoretical results in a simulation 
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study. In Section we analyze two real datasets and apply our method to variable 
selection. Section contains a brief summary of the results. Finally, the proofs were 
deferred to the Supplementary material. 

Notation. For a vector x = G and p G (0,oo] we use ||x||p to de¬ 

note the p—norm of x in the classical sense. We denote ||x||o = |{i : Xi / 0}|. For a 
matrix A G we use the notations |||^|||oo = max* ||ef A||i, |||A|||i = |||^'^|||oo and 
Halloo = rnaxjj \Aij\. The symbol vec( 74 ) denotes the vectorized version of a matrix 
A obtained by stacking the rows of A on each other. By e* we denote a p-dimensional 
vector of zeros with one at position i. For real sequences /„, we write fn = 0{gn) if 
|/n| < C\gn\ for some C > 0 independent of n and all n > C. We write fn ^ gn if both 
fn = 0{gn) and 1/fn = 0(1/gn) hold. Finally, fn = o(gn) if lim^^oo /n/ffn = 0. Fur¬ 
thermore, for a sequence of random variables Xn we write x„ = Op(l) if Xn is bounded 
in probability and we write Xn = Op(rn) if Xn/rn = Op(l). We write Xn = op(l) if 
Xn converges in probability to zero. 

p 

Let denote the convergence in distribution and —)• the convergence in probability. 
Let denote the cumulative distribution function of a standard normal random vari¬ 
able. By Amin(^) and Amax(^) we denote the minimum and maximum eigenvalue 
of A, respectively. Let a V 6 , a Ab denote max(a, 6 ), min(a, 6 ), respectively. We use 
letters C, c to denotes universal constants. These are used in the proofs repeatedly 
to denote possibly different constants. 


2 De-sparsified nodewise Lasso 


Our methodology is a simple modification of the nodewise Lasso estimator proposed in 


Meinshausen and Biihlmann (2006). The idea is to remove the bias term which arises 
in the nodewise Lasso estimator due to £i-penalty regularization. This approach in 


inspired by literature on semiparametric statistics Bickel et al. (1993); van der Vaart 


(2000). We note several papers have used this idea in the context of high-dimensional 


sparse estimation, see 

Zhang and Zhang ( 

2014) 

van de Geer et al. 

(2013); 

van de Geer 

(2016) 

Javanmard and Montanari 

(2014 

); 

Jankova and van de Geer 

(2015 

)• 


We first summarize the nodewise Lasso method introduced in IMeinshausen andl 
Biihlmann (2006) and discuss some of its properties. This method estimates an 


unknown precision matrix using the idea of projections to approximately invert the 
sample covariance matrix. For each j = 1,... ,p we define the vector 7 ^ = {'yj,k: k ^ j} 
as follows 

7i := “ ^-jAWl/n (1) 


and denote rjj := Xj — X_j 7 j and the noise level by tJ = 'Krjjrjj/n. We define the 
column vector Fj := (— 7 j,i,..., — 7 jj_i, 1, — 7 ^ 7 + 1 ,..., — 7 j,p)'^. Then one may show 

00 = (0?,..., 0°) = (Fi/rf,..., Fp/r^), (2) 

where &j is the j-th column of 0o. Hence the precision matrix 0o may be recovered 
from the partial correlations and from the noise level tJ. In our problem, we are 
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only given the design matrix X. The idea of nodewise Lasso is to estimate the partial 
correlations and the noise levels by doing a projection of every column of the design 
matrix on all the remaining columns. In low-dimensional settings, this procednre 
would simply recover the sample covariance matrix X^X/n. However, due to the 
high-dimensionality of our setting, the matrix X^X/n is not invertible and we can 
only do approximate projections. If we assume sparsity in the precision matrix (and 
thns also in the partial correlations), this idea can be effectively carried ont using the 
Lasso. Hence, for each j = 1,... ,p dehne the estimators of the regression coefficients, 
7 j = k = 1,... j ^ k} £ as follows 

7 j:=arg min ||Xj - X_j 7 |||/n-h 2Aj||7||i. (3) 

7eKP-i 

We further dehne the column vectors 

:= (—7i,i) • • •; 1) • • •) “7i,p) ) 

and estimators of the noise level 


■= W^j - ^-jljh/n + XjWljWi, 


for j = 1,... ,p. Finally, we dehne the j-th column of the nodewise Lasso estimator 
0 as 


e, ;= r,v7 


(4) 


The estimator Qj of the precision matrix was stndied in several papers (following 


Meinshausen and Biihlmann (2006)) and has been shown to enjoy oracle properties 


under mild conditions on the model. These conditions include bounded spectrum 
of the precision matrix, row sparsity of small order n/logp and a sub-Gaussian dis¬ 
tribution of the rows of X (alternatively to sub-Gaussianity, one may assume that 
the covariates are bounded as in van de Geer et al. ( 2013| )). Onr approach uses the 
nodewise Lasso estimator as an initial estimator. The next step involves de-biasing or 
de-sparsifying, which may be viewed as one step using the Newton-Raphson scheme 
for numerical optimization. This is eqnivalent to “inverting” the Karush-Kuhn-Tucker 


(KKT) conditions by the inverse of the Fisher information as in van de Geer et al. 
( |2013 ). The challenge then also comes from the need to estimate the Fisher informa¬ 
tion matrix which is a x p'^ matrix. We show that the estimator 0 can be used in a 
certain way to create a surrogate of the inverse Fisher information matrix. Since the 
estimator Qj can be characterized by its KKT conditions, it is convenient to work 
with these conditions to derive the new de-sparsified estimator. Consider hence the 
KKT conditions for the optimization problem ^ 


— -|- Xjkj = 0 , 


(5) 


for j = 1 ,... ,p, where kj is the sub-differential of the function 7 ^ 1 —)> || 7 jHi at 7 j, i.e. 

^ _ J sign(7j-fc) if 7j-fc / 0 

— I 

\aj,k £ [“ 1 ) 1 ] otherwise, 
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where A: G {1,... ,p} \ {j}. If we define Zj to be a p x 1 vector 

Zj := (kj,!, • • •, kjj—i, 0, , kj^p )/tj , 

then the KKT conditions may be equivalently stated as follows 

— Cj — XjZj = 0, for j = 1,... ,p, (6) 

where S = X^X/ n is the sample covariance matrix. This is shown in Lemma 11 in 
the Supplementary material. Consequently, the KKT conditions Q imply a bound 
||S0j — CjWoo < each j = 1,... ,p, which will be useful later. Note that the 

KKT conditions may be equivalently summarized in a matrix form as S0—/—ZA = 0, 
where the columns of Z are given by Zj for j = 1,... ,p and A is a diagonal matrix 
with elements (Ai,..., Ap). 

Multiplying the KKT conditions ([^ by 0j, we obtain 

0f (S0J -ej)- 0f XjZj = 0. 

Then we note that adding 0jj — to both sides and rearranging we get 

0,, - 0f A,Z, - 0°. = 0,, - 0f {tQ, -ej)- 0°. (7) 

= -(0Of(S-So)0° + A,,-, 

where Ay = —(0* — 0?)'^(E0j — ej) — (0j — 0°)'^(S0° — Cj) is a term which can 
be shown to be op(n“^/^) under certain conditions (Lemma [^. Hence we define the 
de-sparsified nodewise Lasso estimator 

f := 0 - 0^(S0 - /) = 0 + 0^ - 0^S0. (8) 


3 Theoretical results 


In this part, we inspect the asymptotic behaviour of the de-sparsified nodewise Lasso 
estimator Q . In particular we consider the limiting distribution of individual entries 
of T and show the convergence to the Gaussian distribution is uniform over the con¬ 
sidered model. For construction of confidence intervals, we consider estimators of the 
asymptotic variance of the proposed estimator, both for Gaussian and sub-Gaussian 
design. We derive convergence rates of the method in supremum norm and consider 
application to variable selection. 

For completeness, in Lemma in the Supplementary material, we summarize con¬ 
vergence rates of the nodewise Lasso estimator. The result is essentially the same 


as Theorem 2.4 in van de Geer et al. (2013) so the proof of the common parts is 
omitted. Recall that V = {1,... ,p} and we define the row sparsity by Sj := ||0°||o, 
maximum row sparsity by s := maxi<j<p Sj and the coordinates of non-zero entries 
of the precision matrix by Sq := {(i,j) G V x V : 0°^ / 0}. For the analysis below, 
we will need the following conditions. 
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A1 (Bounded spectrum) The inverse covariance matrix ©o := Sg ^ exists and there 
exists a universal constant L >1 such that 

1/L < Amin(0o) ^ -^max (0o) < L. 

A2 (Sparsity) = o(l). 

A3 (Sub-Gaussianity condition) Suppose that the design matrix X has uniformly 
sub-Gaussian rows Aj, i.e. there exists a universal constant K such that 

sup Eexp (|a^Aj|^/Ar^) < 2 (i = l,...,n). 
oeKP;||a||2<l 


The lower bound in Al guarantees that the noise level r? = 1/©^^ does not diverge. 
The upper bound (equivalently lower bound on eigenvalues of Sq) guarantees that 
the compatibility condition (see Biihlmann and van de Geer (2011)) is satisfied for 


the matrix _■, which is the true covariance matrix Sq without the j-th row 


and j-th column. The sub-Gaussianity condition A3 is used to obtain concentration 
results which are crucial to our analysis. Condition |A3| is also used to ensure that the 
compatibility condition is satisfied for S with high probability (see 


van de Geer (2011)). Conditions Al, A2 and A3 are the same conditions as used in 


Biihlmann and 


van de Geer et al.l (2013) to obtain rates of convergence for the nodewise regression 


estimator. Define the parameter set 
G{s) :={©E 


RP^P ; max 11©* 

1<2<P 


lo < s 


Al 


is satisfied}. 


The following lemma shows that the proposed estimator T can be decomposed into 
a pivot term and a term which is of small order 1 /\/n with high probability. 

Lemma 1. Suppose that © is the nodewise Lasso estimator with regularization pa¬ 
rameters Xj > uniformly in j, for some sufficiently large constant c > 0. 

Suppose that A2 and|A3| are satisfied. Then for each {i,j) ^ V x V it holds 


- 0“ ) = -^^^(0")'(s - So)0“ + 
where there exists a constant C > 0 such that 


(9) 


lim sup P I max |Ajj| > C 
^^°°eoeg(s) V*d=b-,p 


slogp \ ^ Q 
ri J 


From Lemma it follows that we need to assume stronger sparsity condition than 
A2 for the remainder term Ajj to be negligible after normalization by ^/n. This is 
accordance with other literature on the topic, see van de Geer et al. (2013), Ren et al. 


(2015). Hence we introduce the following strengthened sparsity condition. 
A2* ii^ = o(l). 
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The next result shows that the elements of T are indeed asymptotically normal. To 
this end, we further define the asymptotic variance 

4 := var((0O)^XiXf0O). 

In some of the results to follow, we shall assume a universal lower bound on aij as 
follows. 


A4 There exists a universal constant a; > 0 such that > ui. 


Assumption |A4| is satisfied e.g. under Gaussian design and Denote a parameter 
set 

g{s) :={0 G 


: max ||0i||o < s 

l<i<p 


A1 


A4 


are satisfied}. 


Theorem 1 (Asymptotic normality). Suppose that 0 is the nodewise La sso es tim ator 
with regularization parameters \j x uniformly in j. Suppose that 

are satisfied. Then for every {i,j) G V x V and z gM. it holds 


A2* 


and 


A3 


lim sup iPeo < z) - 4>(2;)| = 0. 


To construct confidence intervals, a consistent estimator of the asymptotic vari- 

(see 


ance aij is required. Consistent estimators of aij are discussed in Section 3.1| 
Lemmas and . Hence Theorem implies uniformly valid asymptotic confidence 
intervals la ■= [Tij ± — af2)aijfy/n\, i.e. 


lim sup |Peo(0° G/a) - (1 - q:)| = 0. 
™0oee(.) 


The result also enables testing hypotheses about individual elements of the precision 
matrix. For testing multiple hypothesis simultaneously, we may use the standard 
procedures such as Bonferroni-Holm procedure (see 


van de Geer et al. (2013 


3.1 Variance estimation 

For the case of Gaussian observations, we may easily calculate the theoretical variance 
and plug in the estimate 0 in place of the unknown 0o as is displayed in Lemma 
below. 


Lemma 2. Suppose that assumptions 'AE and \A3\ are satisfied and assume that the 
rows of the design matrix X are independent Af {0,T,o)-distributed. Let 0 be the 
nodewise Lasso estimator and let Xj > cry^logp/n uniformly in j for some r, c > 0. 
0 ,- 


Then for afj := 0 




+ el we have 

•'J 


sup P ( max \al — al\ > Cr^/s logp/n ) < cip^ 
eo&g{s) J 


for some constants Cr,ci,C 2 > 0. 















Lemma [^implies that under s = oi^sjnj logp), we have a rate \al^ — afj\ = op(l/n^/^). 
If Gaussianity is not assumed, we may replace the estimator of the variance with 
the empirical version, and plug in 0 in place of the unknown ©q. Thus we take the 
following estimator of afj, where 0 is the nodewise regression estimator 


^ij ■ 


E 

k=l 




2 


( 10 ) 


A2* A3 


The following lemma justifies this procedure under Al 

Lemma 3. Suppose that the assumptions A2* and ^ are satisfied and for some 
e > 0, it holds that lim„_,.oo log^(p V n)/n^~^ = 0. Let 0 be the nodewise Lasso 
estimator and let Xj > cTy/\ogpjn uniformly in j for some r, c > 0. Let dij he the 
estimator defined in (10). Then for all r] > 0 


lim sup 

Ooegis) 


( max 




= 0 . 


3.2 Rates of convergence 


The de-sparsified estimator achieves optimal rates of convergence in supremum norm. 
Observe first that for the nodewise regression estimator it holds by ([^ , Lemma 1 and 
Lemma 10 in the Supplementary material that 

©ii - ©ii = ©r- ej) + Op ^max | ' 

By Holder’s inequality and the KKT conditions it follows 

| 0 f(E 0 ,--e,-)| < A,-110,111/^1. 


Consequently, for the rates of convergence of the nodewise Lasso in supremum norm 
we find 


11© - ©olloo = Op ^maxj 

De-sparsifying the estimator © as in 
the above rates. 


max Aj||0,||i/f? s 


logp j 




n 


n 


8) removes the term involving Aj ||©,||i/r? in 


Theorem 2 (Rates of convergence). Assume that A2 and A3 are satisfied. Let r > 0 
and let T be the de-sparsified nodewise Lasso estimator with regularization parameters 
Xj > CTY^logp/n for some sufficiently large constant c > 0, uniformly in j. Then 
there exist constants Cr,ci,C 2 > 0 such that 


sup 

0oee(s) 


(^Tij — 0°j| > Or max 


1 ^ logp |\ 

^/n'’ n j J 


< cie 


— C2T 


and 


sup P [ IIT—©ollcxD > Or max 
eoee?(s) V 



logp 

s - 

n 
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We compare the results of Theorem with results on optimal rates of convergence 
derived for Gaussian graphical models in |Ren et al. (2015). Suppose that the ob¬ 
servations are Gaussian, i.e. Xi,... ,Xn ~ AA(0,So). For s < Con/\ogp for some 
Co > 0 and p > for some u > 2 it holds (see Ren et al. (2015|)) 


and 


inf sup P ( |0ij — 0° | > max |ci^^, ) > Cl > 0, 

e^Jeo&g{s) V I n 


inf sup P 
e eoee(s) 


^||0 - 0o||oo > max 



> C2 > 0, 


where Ci,C 2 ,Cj,C 2 are positive constants depending on v and Co only. As follows 
from Theorem the de-sparsified nodewise Lasso attains the lower bound on rates 
and thus is in this sense optimal (considering the Gaussian setting). 


3.3 Other de-sparsified estimators 

The de-sparsification may work for other estimators of the precision matrix, pro¬ 
vided that certain conditions are satisfied. This is formulated in Lemma [4] below. A 
particular example of interest is the square-root nodewise Lasso estimator, that will 
be discussed below. This estimator has the advantage that it is self-scaling in the 
variance, similarly as the square-root Lasso (Belloni et al., 2011) on which it is based. 


Lemma 4. Assume that for some estimator Cl = (fli, ..., Clp) it holds 


max ll^j - 0?||i = Or>{slogp/n), ||SO - /||oo = Op{^logp/n). (11) 

J = lv,P 


Then for T := Cl + Cl^ — Cl^ Efl it holds under Al. ASC . A3 


\T- 0o||oo = Cp(max{slogp/n, y/\ogp/n}). 


Moreover, \/n{fij - QC)/<Jij A/'(0,1). 

We briefly consider nodewise regression with the square-root Lasso as an example. 
The square-root Lasso estimators may be defined via 


% := arg min \\Xj - X_j 7 || 2 /n -h 2Ao||7l|i, 

7eiRp-i 

for j = l,...,p. Define f? := \\Xj — X_j 7 j|| 2 /n and fj := fj + AoTjU^jHi. The 
nodewise square-root Lasso is then given by 0j,sqrt := where 

Fj := (—7i,i) • • •! 1) • • • > ~lj,p) ■ 

Note that compared to the nodewise Lasso, the difference lies in estimation of the 
partial correlations, where we used the square-root Lasso that “removes the square” 
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from the squared loss. The Karush-Kuhn-Tucker conditions similarly as in Lemma 
11 in Supplementary material give 


A. - 

^0j,sqrt Cj — 0, 


and iij^i is the sub-differential of the 


where Zj := (Kj,i,..., 1, ■ ■ ■ ,k 

function (3 i—>■ ||/3||i with respect to jSi, evaluated at 7 j. The paper Belloni et al. ( 2011| 
further shows that the £i-rates for the square-root Lasso satisfy condition (11). The 
de-sparsified estimator may then be defined in the same way as in (|^. Then the 
conditions of Lemma are satisfied and this implies that a de-sparsified nodewise 
square-root Lasso achieves the same rates as the de-sparsihed nodewise Lasso and 
thus is also rate-optimal. 


3.4 The thresholded estimator and variable selection 


The de-sparsihed estimator can be used for variable selection without imposing irrep- 
resentable conditions. Under mild conditions, the procedure leads to exact recovery 
of the coefficients that are sufficiently larger in absolute value than the noise level. 
The following corollary is implied by Theorem and Lemmas and 


Corollary 1. Let 0 be obtained using the nodewise Lasso and T be defined as in Q) 
with tuning parameters Xj > ct uniformly in j, for some c, r > 0. Assume 

Let (T. 


that conditions Al. AZ*" 


A3 and Af are satisfied, 
estimator from Lemma and assume that log^(p V n)/n^~ 
Then there exists some constant > 0 such that 


'ij, 


i,j = 1 , 


,p be the 


= o(l) for some e > 0. 


lim P( max \Tij — 0 °,|/djj > Cr^/logp/n) = 0 . 
n^oo 


If, in addition, the rows ofX. areAf{0, Tifij-dsitributed and aij, i, j = 1,... ,p is instead 
the estimator from Lemma\^ then there exist constants ci,C 2 ,Cr such that 

P( max \fij - > Cr\/logp/n) < 

Corollary implies that we may dehne the re-sparsihed estimator 

^thresh ._ .1 

b ■ b \fij\>CT^ij^/\ogp/n^ 

where aij is dehned as in Corollary|l| Denote := {(i, j) G V x V : fiz 0 }. 

Denote := {{i,j) € V x V : |0? | > 2Cr(Jijy/logp/n}. Then it follows directly 
from Corollary that with high probability 

^act ^ ^thresh ^ 

The inclusion 5“* C represents that correctly identihes all the non¬ 

zero parameters which are above the noise level. The inclusion 5 * 1 ^ 1 '®®!! (7 Sq means 
that there are no false positives. If for all {i,j) G Sq it holds 

|0ij| > 2CrCrijy/logp/n, (12) 

then we have exact recovery, i.e. with high probability: = Sq. 
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4 Further comparison to previous work 


Closely related is the paper Jankova and van de Geer (2015), where asymptotically 
normal estimation of elements of the concentration matrix is considered based on 
the graphical Lasso. While the analysis follows the same principles, the estimation 


method used here does not require the irrepresentability condition (Ravikumar et al. 


2008) that is assumed in Jankova and van de Geer (2015). Hence we are able to show 


that our results hold uniformly over the considered class. Furthermore, regarding the 
computational cost, our method uses Lasso regressions, which can be implemented 
using fast algorithms as in Efron et al. (2004). In comparison, the graphical Lasso 
method presents a more challenging computational problem, for more details see e.g. 


Mazumder and Hastie (2012). 


Another related work is the paper Ren et al. (2015). This paper suggests an estima¬ 
tor for the precision matrix which is shown to have one dimensional asymptotically 
normal components with asymptotic variance to The assumptions 

and results used in the paper are essentially identical with our assumptions and the¬ 
oretical results in the present paper. However, there are some differences. The paper 
Ren et al. (2015) assumes Gaussianity of the underlying distribution, while we only 
require sub-Gaussianity of the margins. Another difference is in the construction of 
the estimators. Both approaches use regression to estimate the elements of the pre¬ 
cision matrix, but the paper Ren et ah] (2015) concentrates on estimation of the joint 
distribution of each pair of variables for i,j = 1,... ,p. Thus it is compu¬ 


tationally more intensive as it requires 0{ps) high-dimensional regressions (see Ren 


et al. (2015)), while our methodology only requires 0{p). 


5 Simulation results 


In this section we report on the performance of our method on simulated data and 
provide a comparison to another methodology. The random sample Xi,..., sat¬ 
isfies EXj = 0,var(Xj) = 0g^, where the precision matrix 0o = five-diag(/90)Pi,P 2 ) 
is defined by 

/ 

Po if * = j, 

00 b-i = 

P2 lf|^-Jj=2, 

0 otherwise. 


We consider the settings Si = (po,Pi,P 2 ) = (1,0.3,0) and S 2 = (po,Pi,P 2 ) = 
(1,0.5,0.3). The second setting (1,0.5,0.3) is further adjusted by randomly perturb¬ 
ing each non-zero off-diagonal element of 0o by adding a realization from the uniform 
distribution on the interval [—0.05,0.05]. We denote this new perturbed model by 
(1,0.5, 0 . 3 ) 1 /. Hence the second precision matrix was chosen randomly. The sparsity 
assumption requires s = o{^/n/ logp). We have chosen the sample sizes for numeri¬ 
cal experiments according to the sparsity assumption (for this purpose, we ignored 
possible constants in the sparsity restriction), i.e. n > s^log^p. 
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5.1 Asymptotic normality and confidence intervals for individual 
parameters 


5.1.1 The Gaussian setting 


In this section, we consider normally-distributed observations, A* AA(O,0o^), for 
i = In Figure 1 we display histograms of y/n{Tij — for (i,j) G 

{(1,1), (1, 2), (1, 3)}, where T is defined in ^ and the empirical variance aij is esti¬ 
mated as suggested by Lemmaj^ Superimposed is the density of AA(0,1). 

Secondly, we investigate the properties of confidence intervals constructed using the 
de-sparsified nodewise Lasso. For comparison, we also provide results using confi¬ 


dence intervals based on the de-sparsified graphical Lasso introduced in Jankova and 


van de Geer (2015). The coverage and length of the confidence interval were estimated 


by their empirical versions. 


a 




:= ^N'i-{eo,ijelij,c} and iij := PAr21> (1 - Q;/2)dij/\/n, 


respectively, using N = 300 random samples. For a set A <Z V xV, we define the 
average coverage over the set A (and analogously average length avglength^) as 

avgcov^ := ^ 

We report average coverages over the sets Sq and 5 q. These are denoted by avgcovg^^ 
and avgcov^g, respectively. Similarly, we calculate average lengths of confidence in¬ 
tervals for each parameter 0^^ from N = 300 iterations and report avglength^^^^ and 
avglength^g. 

The results of the simulations are shown in Tables 1 and 2. The target coverage level 
is 95%. The methodology for the choice of the tuning parameters was used as follows 


(see Ren et al. (2015)), for both methods, 

s = y/n/\ogp,B = qt(l — s/{2p),n — 1), A = Bj^/n — 1 + 


(13) 


where qt(/3,n — 1) denotes the /3-quantile of a f-distribution with n — 1 degrees of 
freedom. 


5.1.2 A sub-Gaussian setting 

In this section, we consider a design matrix with rows having a sub-Gaussian distribu¬ 
tion other than the Gaussian distribution. Let U := {Ui ,..., [/„) be an n x p matrix 
with jointly independent entries generated from a continuous uniform distribution on 
the interval [—-y/S, \/3]. Further consider a matrix 0o ;= five-diag(l, 0.3,0) and let 
So = 00 Then we define 

Ai := sJ/'r, 

for i = 1,... ,n. Then the expectation of Aj is zero and the covariance matrix of Aj 
is exactly Sq and the precision matrix is 0o. It follows by Hoeffding’s inequality that 
Aj defined as above is sub-Gaussian with a universal constant A > 0. 


A further difference compared to the simulations in Section 5.1.1 is that we now 
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Asymptotic normality in the Gaussian setting 



■^(Tll ~ 6ll)/Oi, '/rr(Ti2~ 9 i2)/^Ti2 ~ 6i3)/Oi3 


Figure 1: Histograms for y/n{fij — {i,j) £ {(1,1), (1, 2), (1, 3)}. The sample size was 

n = 500 and the number of parameters p = 100. The nodewise regression estimator was calculated 
300 times. The setting is Si = (1,0.3,0). 


Gaussian setting: Estimated coverage probabilities and lengths 


Setting Si = 

(1,0.3,0) 

So 

So 

SS 

s§ 

P 

n 


avgcov 

avglength 

avgcov 

avglength 

100 

191 

D-S NW 

0.945 

0.302 

0.963 

0.262 

D-S GL 

0.931 

0.293 

0.974 

0.254 

200 

253 

D-S NW 

0.947 

0.267 

0.963 

0.232 

D-S GL 

0.928 

0.254 

0.976 

0.220 

300 

293 

D-S NW 

0.949 

0.238 

0.965 

0.220 

D-S GL 

0.928 

0.236 

0.977 

0.205 

400 

324 

D-S NW 

0.948 

0.246 

0.965 

0.230 

D-S GL 

0.925 

0.228 

0.981 

0.223 


Table 1; A table showing a comparison of de-sparsified nodewise Lasso (D-S NW) and de-sparsified 
graphical Lasso (D-S GL). Parameter p takes values 100,200,300,400 and the corresponding values 
n are given by n = log^ p ], where s = 3. The regularization parameter was chosen as described 
in (131. The number of generated random samples was N = 300. 


estimate the variance of the de-sparsified estimator using the formula proposed in 


(10) for sub-Gaussian settings: 


^ij ■ 


lE(6[x,x[0i)^-e 


2 

b’ 


(14) 


k=l 


where 0 is the nodewise Lasso. The regularization parameters for the nodewise Lasso 
are used in accordance with (13). Figure 2 again displays the histograms related to 
several entries of the de-sparsified nodewise Lasso. Results related to the constructed 
confidence intervals are snmmarized in Table 3. The resnlts demonstrate that the 
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Gaussian setting: Estimated coverage probabilities and lengths 


Setting S2 = 

(l,0.5,0.3)c/ 

So 

So 

SS 

SS 

P 

n 


avgcov 

avglength 

avgcov 

avglength 

100 

531 

D-S NW 

0.896 

0.164 

0.975 

0.146 

D-S GL 

0.781 

0.153 

0.980 

0.137 

200 

702 

D-S NW 

0.868 

0.142 

0.976 

0.126 

D-S GL 

0.729 

0.133 

0.982 

0.119 

300 

814 

D-S NW 

0.863 

0.131 

0.976 

0.117 

D-S GL 

0.712 

0.124 

0.984 

0.110 

400 

898 

D-S NW 

0.859 

0.125 

0.976 

0.111 

D-S GL 

0.709 

0.118 

0.984 

0.105 


Table 2: A table showing a comparison of de-sparsified nodewise Lasso (D-S NW) and the de- 
sparsified graphical Lasso (D-S GL). Parameter p takes values 100, 200, 300,400 and the corresponding 
values n are given by n = [s^log^p], where s = 5. The regularization parameter was chosen as 
described in (131. The number of generated random samples was N = 300. 


de-sparsified nodewise Lasso performs relatively well even under this non-Gaussian 
setting. 

Asymptotic normality in the sub-Gaussian setting 



4fr(Tii -6ii)/cjii 4ri(T,2-0i2)/<5i2 ^(Tis —Qial/^is 


Figure 2: Histograms for {i,j) £ {(1,1), (1, 2), (1, 3)}. The sample size was 

n = 500 and the number of parameters p = 100. The nodewise regression estimator was calculated 
300 times. The setting is Si = (1,0.3,0). 


5.2 Variable selection 


For variable selection as suggested in Corollary we compare the de-sparsihed node¬ 
wise Lasso and the de-sparsified graphical Lasso. The setting is again as in Section 
5.1.1[ Average true positives and false positives over 100 repetitions are reported. 


Choice of the tuning parameters is according to (13) and the thresholding level is 
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Sub-Gaussian setting: Estimated coverage probabilities and lengths 


Setting Si = 

(1,0.3,0) 

So 

So 

SS 

s§ 

P 

n 


avgcov 

avglength 

avgcov 

avglength 

100 

191 

D-S NW 

0.906 

0.234 

0.949 

0.249 

D-S GL 

0.811 

0.190 

0.944 

0.216 

200 

253 

D-S NW 

0.909 

0.203 

0.950 

0.217 

D-S GL 

0.791 

0.165 

0.946 

0.187 

300 

293 

D-S NW 

0.911 

0.189 

0.950 

0.202 

D-S GL 

0.765 

0.152 

0.947 

0.173 

400 

324 

D-S NW 

0.911 

0.180 

0.951 

0.192 

D-S GL 

0.740 

0.143 

0.947 

0.164 


Table 3: A table showing a comparison of de-sparsified nodewise Lasso (D-S NW) and the de- 
sparsified graphical Lasso (D-S GL). Parameter p takes values 100, 200, 300,400 and the corresponding 
values n are given by n = [s^log^p], where s = 3. The regularization parameter was chosen as 
described in (131. The number of generated random samples was N = 300. 


given by 


A 


thresh — 



(15) 


taking u = 1 for the de-sparsified nodewise regression, v = 0.5 
graphical Lasso. We take aij = 0^0 


33 


+ 02 . 


as in Lemma 


simulation experiment are summarized in Table 4. 


for the de-sparsihed 
The results of this 


Estimated true positives (TP) and false positives (FP) 


Setting Si = 

(1,0.5, 0.4) 

TP 

TP rate % 

FP 

FP rate % 

p= 100 

D-S NW 

494 

100.0 

0 

0 

|So| = 494 

D-S GL 

493.98 

99.999 

0 

0 

p= 200 

D-S NW 

994 

100.0 

0 

0 

|So| = 994 

D-S GL 

993.62 

99.961 

0 

0 

p= 300 

D-S NW 

1494 

100.0 

0 

0 

|So| = 1494 

D-S GL 

1492.42 

99.894 

0 

0 

p = 400 

D-S NW 

1994.00 

100.0 

0 

0 

|So| = 1994 

D-S GL 

1989.08 

99.753 

0 

0 


Table 4: Estimated true positives (TP) and false positives (FP) for the de-sparsified nodewise 
regression estimator (D-S NW) and for the D-S GL estimator. The sample size n = 400 was held 
constant for all the values of p; the number of repetitions was N — 100. The thresholding levels was 
chosen as in (151. 


6 Real data experiments 

We consider two real datasets, where we model the conditional independence structure 
of the covariates using a graphical model. In particular, we aim to do edge selection 
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and we estimate the edge structure of the graphical model using the de-sparsihed 
nodewise Lasso. The first dataset is the Prostate Tumor Gene Expression dataset, 
which is available in the R package spls. The second dataset is about riboflavin 
(vitamin B 2 ) production by bacillus subtilis. The dataset is available from the R 
package hdi. 

For both datasets, the procedure is essentially identical. We only consider the first 500 
covariates which have the highest variances. In the first step, we split the sample and 
use 10 randomly chosen observations to estimate the variances of the 500 variables. 
With the estimated variances, we scale the design matrix containing the remaining 
observations. We calculate the nodewise Lasso using the tuning parameter as in the 
simulation study, and then calculate the de-sparsihed nodewise Lasso. We threshold 
the de-sparsihed nodewise Lasso at the level {1 — a/{2p^))aij/^/n, where a = 0.05 

and aj, = 0**0** -F 0|- is an estimate of the asymptotic variance calculated under the 


assumption of normality and using the nodewise Lasso estimator 0. 

The hrst dataset contained observations of p = 4088 logarithms of genes expression 
levels from n = 71 genetically engineered mutants of bacillus subtilis. We considered 
500 variables with the highest variances, hence a full graph contains (^ 2 ^) edges. The 
de-sparsihed nodewise Lasso identihed 20 edges as signihcant. For comparison, the 
de-sparsihed graphical Lasso introduced in Jankova and van de Geer ( |2015[ ) identihed 
5 edges as signihcant. It is worth pointing out that the set of edges selected by the de- 
sparsihed graphical Lasso is a subset of the edges selected by de-sparsihed nodewise 
Lasso. 

The second dataset contained n = 102 observations on p = 6033 variables. We used 
the procedure above to do edge selection using the de-sparsihed nodewise Lasso. Gur 
analysis identihed 108 edges as signihcant using the de-sparsihed nodewise Lasso. 
For comparison, the de-sparsihed graphical Lasso identihed 28 edges as signihcant. 
Again, the set of edges selected by the de-sparsihed graphical Lasso is a subset of the 
edges selected by de-sparsihed nodewise Lasso. 


7 Conclusions 

We proposed a methodology for low-dimensional inference in high-dimensional graphi¬ 
cal models. The method, called the de-sparsihed nodewise Lasso, is easy to implement 
and computationally competitive with the state-of-art methods. We studied asymp¬ 
totic properties of the de-sparsihed nodewise Lasso under mild conditions on the 
model. The de-sparsihed nodewise Lasso enjoys rate optimality in supremum norm 
and leads to exact variable selection under beta-min conditions and mild conditions 
on the model. We demonstrated its performance on several models in a simulation 
study and on two real datasets. These numerical studies showed that it performs 
well in a variety of settings, including non-Gaussian settings. Further open questions 
concern for instance the asymptotic efficiency of the proposed estimator, similarly as 
in the low dimensional settings. 
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Supplementary material 

We summarize some preliminary material in Section and proofs of all the results in 
Section In appendices and we summarize some known results. 


8 Preliminary results: Rates of convergence of the node¬ 
wise Lasso 


We provide a brief overview of the results for the nodewise Lasso (derived in 


van 


de Geer et al. (2013)) in the following two Lemmas. Lemma below follows from 


classical concentration results for sub-Gaussian random variables (see Biihlmann and 


van 


de Geer (2011)). The proof of Lemma below can be found in [van de Geer et al. 


( |2013D . 

Lemma 5. Suppose that Xj,j = 1,... with lEXj = 0 and var(Wj) = Sq satisfies 
the eigenvalue condition AJ_ and the sub-Gaussianity condition \A^ Then there exist 
constants ci,C 2 ,c> 0 such that for any r sufficiently large 

where 

Tj := {||S0° - ej\\oc/n < CTs/logp/n, 

||S - Solloo < CTs/logp/n, 

\rijrij/n - r|| < CT^/logp/n}. 

Moreover, for the set we get by the union bound 

FUn^Tiy) = F{uPTf) < p max F{Tf) < cip^-^^\ 


Lemma 6 (a version of Theorem 2.4 in van de Geer et al. (2013)). Let r > 0 and 
suppose that 0 is the nodewise Lasso estimator ^ with regularization parameters 

> CT\i^^ uniformly in j, for some sufficiently large constant c > 0. Suppose 


u — 


that EH fASl |T3| are satisfied. Then there exists a constant Cr > 0 such that on the 
set Tj defined in Lemma^ it holds 


||0i - 0?||l < CrSy/\ogpJn, \fj - Tj \ < Cry/ S log p/u, 


||0j — 0j||| < CrSlogp/n. 

Furthermore, on the set rfj^fTj we have 

max ||0j — 0?||i < CrSyJ\ogp/n, max \f‘^ — r?| < Cry/ s logp/n, 


max ||0j — 0'?||2 < Crs\ogp/n. 
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9 Proofs for Section [3] 


Lemma\^ We first derive a bound for |Ajj| (which will be useful later) and then a 
bound for ||A||oo. 

Let r > 0 and consider the set Tj defined in Lemma Note that by Lemma 
(and since 110°II 2 < L, ||0°||2 < L) we have P(7i n Tj) < for some constants 

Cl, C 2 > 0. We now condition on the set TCiTj. 

By the definition of T, we have 

T^J-e'^j = 0.,-0^-0fA,^ 

= -(0°)'^(E-S°)0° 

+ (S0° - eO^(0° - 0,) + (0° - 0O^(S0i - ej). 

'-V---' '-V-' 

remi rem 2 


For the first remainder, we obtain 


Iremil = |(E0° - ei)^(0° - 0^)1 < ||B0° - 


10°-0 


fill- 


Under XT and A3, by Lemma 10 in Appendix]^ we have 


max ||S0° - ejWoo = Op(v^logp/n). 


Consequently, and using £i-rates of convergence from Lemma we obtain for some 
constant Cr > 0 

|remi| < Crslogp/n. 


By Lemma 


we have |r? — t?| < Cr^J ^^d since l/r? = 0(1), we have l/f? = 
O(C't). Hence for the second remainder, by the KKT conditions and Lemma we 
obtain 


|rem2| = |(0° - 0i)^(B0j - ej)| 

< ||E%-e,-1100110°-0*||i 

< Aj/fj(l + \\kj\\i)syJ\ogp/n = 0{Crs\ogp/n). 

Therefore, for Ajj := remi + rem 2 , we have |Ajjl = 0(slogp/n) on the set TCiTj- 

Now condition on the set By Lemmaj^ we have for some constants ci, C 2 > 0 

it holds 

P((nf=i7^)'=) < 

We again have the decomposition 

f - 00 = -0o(B - Bo)0o + (0oS - /)(0o - 0) + (0o - 0)^(S0 - I) • (16) 

'-V-' 

A 
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But then we obtain 


By Lemma 


< 

||(0oS-/)(0o-0)|| 

< 

||0oS-/||oc 

® 

O 

1 

< 

max E0? 

11 

— folloo 


+ 

00-0 

max / 

1 

have 

00-0 

= o{a 


+ 


00-0 


IS0-/I 


00-0 


= 0{CrS^^\ogp/n) and maxj=i^.,,^p |f? — r?| = 


C>(Cr maxj=i^...^p y'sAj). Hence, and by the last display we get 


|A||oo = \/n||A||oo = 0{Crs\ogp/^/n). 


Therefore for some constant C,- > 0 we obtain 

1P(||A|U > Crs\ogp/V^) < n{r\%iTjr) < 

Now note that the constants ci, C 2 are universal, therefore we can take the supremum 
over 0(s) to obtain 

sup IP(||A||oo > C'^slogp/^A^) < 

©oS5(s) 

If we choose r > 0 sufficiently large so that 1 — C 2 T < 0 then 

lim sup P(||A||oo > CrSlogp/^/n) < lim = 0. 

©oee(s) 


□ 


Theorem Q By @ , it holds 


where 


y/n{fij - Qij)/aij = -Zfaij + rem/aij, 


Z := Vn(0°)^(S - So)0° =: " ©b')- 


k=l 


and rem := 0r(S0? - ei)'^(0^ 0^) + y/n{Q^ - 0^)^(E0^ - Cj). 
First observe that by Lemma 1, |rem| = O 


'p = op(l)- Note that under 


A3 


the results of Lemma are uniform in 0o. 

Denote := (0?)^Afc Aj0Q — 0°-. To show Zjaij ^ AA(0,1), we apply Berry- 
Esseen theorem (see e.g. Durrett (2010[)). By Lemma in Appendix [A| (note that 


< L, 


2 ^ 


2 < L), we have a moment bound for m > 2, 


nZkr/{2aoL^Kr < "^{Klaor-^ 
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for some constant uq > 0. Under A1, for every m fixed we have supg = 0(1)- 

We have = ^{ZD > uP' for some universal constant a; > 0. Therefore, 


|P 0 o(Z/cTy <z)- $( 2 ;)| < 


3E|Zi|' 


< 


C 


(E|Zi|2)3/2^ - 

where C does not depend on 0o and n, which concludes the proof. 


□ 


9.1 Proofs for Section 13.11 


Lemma\^ Since Xi ~ AA(0, Sq), then Z := ~ AA(0, ©o). It is well known that 

'E{ZfZ‘j) = ©ij©°j + 2(©° )^. Consequently, we get 


= var((©^)2Wi(©^^)^ Xi) = var((©oXi),(©oXi),) = ©^,©^^. + (©o.)2. 

We next use the results of Lemmas and The set from Lemma holds 

with probability at least 1 — for some constants ci, C 2 > 0. We define Ajj := 

Qij — Q^j. Conditioning on the set we thus get (using Lemma (Euclidean 

norm bound) and condition A1) 


max — a, 










< max \QiiQjj — ©°j©?,| + max |0?. — (0°)^' 

< max I Aii A + 0°- A, ,■ + 0°-- Am I 


0 


+ max |Aj,(Ajj + 20J ^1 
z,j=l,...,p 

< CrsJ slogp/n, 

for some constant C,- > 0. 

Lemma O To simplify notation, we write 0 := ©o- We have 


(17) 

□ 


.max |<Tf-fT|| = 


< 




a“plL;E(Uww’'ej)"-0: 

’ ’ L k=l 

-{E(0Wi^f0jf-e^jl 


1 ” . 

.max |-^(0fXfcXj0^ 
k=l 


)2-E(0WiWf0,)' 


+ max I ©?■ — 0,^,' 
».7=i. v P ^ ^ 

II 


By Al, |0ij| = 0(1) and by Lemmawe have 

|0i - ©ilb = Op(\/ s\ogp/n). 


max I 
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Hence 


max \II\<2 max |0jj||0ij - Qij\ + \Qij - ©ijp = C>p(v^slogp/n). 
For the first term, we have 

1 . 

max |/| = max \—'S^{QfXkX^&j)‘^ — E{QiXiXfQj)‘^\ 

i,j=^,--,p hi=i,---,p n ^ 

k=l 

< .max \-j2iQlW^Qjf-iQlXkX^Qj?\ 


+ max I-V(0fXfcXj0,)2-E(0iXfcXj0,)2|. 
fc=l 


By symmetrization, 

1 

E|rem 2 | < 2E max \—'^{QjXkX'^Qj)‘^ek\, 

n ^ 

where ek,k = 1,... ,n is a sequence of independent Radechamer random variables, 
independent of X. Let K > 0 and consider truncation by K as follows 

1 

E|rem 2 | < 2E , max |-^(0fXfeXj0j)2efcl(0Tx,x,T0^-)2<Kl 

k=l 

_ _✓ 


1 ” 

lj 1 "iF 1 


The term ri can be bounded using Hoeffding’s inequality since for f, j = 1,... ,p and 
/c = 1,..., n it holds 

\{^J XkX]^Qjy‘ekl^QTx^xlejY<K\ ^ X. 


Thus for any a > 0 


1 

i XkXlQjf eklfQTx^xlej)^<K\ > «) 


Hence integrating over a, we get for the expectation 

Eri = / E(ri > a)da < —- 1 = 

Jo ^ yn 
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Now we bound Er 2 . Rewriting the expectation and using the union bound, we obtain 


Er2 = 


E(r 2 > x)dx 


< p^n max max 

Jo ij=l,-,pk=l,...,n 




< p^n / max max E((0^XfcX^0j)^ > max(x, R))(ix 

Jo ij=l,--;Pk=l,...,n 


Next by the sub-Gaussianity condition |A3[ we have 
E((0fA:fcAj0j)2 > max(x,i^)) < 
for some constants ci, C 2 > 0. Hence 

/>r>n 

C2 max(x,K)^'^^ 


Er 2 < p n / Cie 

Jo 


dx 


= p^nciKe +p^nci—^ e . 

C 2 V C 2 / 


We choose K = rJd '=)/2 where the e is taken from the assumptions of this lemma. 
Then we get 


E|rem2| < Eri + Er2 

+ p^ncx— 

C 2 V C 2 / 

Then by the assumption log(p V n)lrJd~'^^l^ = o(l), we obtain that the above bound 
converges to zero for n —>■ oo. This implies that |rem 2 | = op(l). 

To bound maxij=i^.,,_p |remi|, we denote Wi^k ■= Qf^k and Wi^k ■= Qf^k for i = 
1,... ,p and k = 1,..., n. Then we can rewrite 

1 " 

remi = \-T(Wi,kWj^kf - {Wi,kWj,kf\. 

k=l 

0bserve that 

1 ” 

remi = \-y^{Wi^kWj,k? - {Wi^kWj,k?\ 

k=\ 

1 ” 

= \-Y,{{Wi,k-W,,k){Wj^k-Wj,k) 

k=l 

+ {m,k - Wi,k)Wj,k 
+ Wi^kiW.^k - Wj^k) + Wi^kWj^k? 

-{Wi,kWj,k)% 
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To bound |remi|, we make use of the last expression where we expand the curly 
bracket {-j^. We have 


1 

remi = - - W,-fe) + 

” fc=i 

+ Wi,k{Wj,k - Wj,k) + Wi^kWj^k? - {Wi^kWj^kf 

1 

= - Y.(Wi,k - W^,kf{Wj,k - Wj,k? 

” k=l 

+ 2{Wi^k - W,^k?{W^,k - Wj,k)Wj,k 
+2{Wi^k - Wi,k){Wj,k - Wj,kfWi,k 
+ 4(TTi,fc - Wi^k){Wj^k - Wj^k)Wi^kWj^k 
+{W^,k - Wi,k?Wlk + 2{Wi,k - W^,k)Wlf,Wi,k 
+WlkiW,,k - W,,kf + 2Wlk{W,,k - Wj,k)Wj,k. 


By (iteratively) applying the Cauchy-Schwarz (C-S) inequality to each summation 
term in the last display, we can bound each term by a term involving 4 — 

Wi^k)^ and 4 ^ik'! ^ = b J- Consequently, it suffices to find a rate for 


1 

max — 
1=1,...,p n 


k=l 


Wi,kf 


and to show that max;=i^...^p 4 ^ik ~ C’p(l)- Then 


max |remi 


Op{ max (- > (Wi^k - 


k=l 




We now show that 


and that 


1 ” . 

max - ^{Wi^k - Wi^kf = Op((slogp/\/n)" 
1=1,...,p n 

k=l 


1 . 

max — Wiu = Op(l). 
i=i,...,pn^ ’ 

k=l 


(18) 


(19) 


The claim (18) follows by Lemma below. We now show (19). When bounding 
maxjj=i_...^p |rem 2 | above, we have shown that for all a > 0 


lim P( max | - V W^k^h “ ^'WfkWh\ > a) = 0. 
1^00 i,j=l,...,p n ^^ 1 J Jj 


k=l 


But then since = 0(1), it follows that also (note here that the case when 

i = j is covered) 


lim P( max — 
n^oo i,j=l,...,p n 


T.^hwf,k 


k=l 


> C 2 ) = 0, 
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for some constant C 2 > 0. 

Collecting all the results above, for any 77 > 0 we have 

lim P( max — afA > rj) = 0 


n^oo 


□ 


Lemma 7. Under the assumptions of Lemma (and using notation of Lemma [^, 


on the set it holds 


max - = 0((siogp/Vnf 

. ’’"ti 

Proof. We will first show that 

1 ” 

max - '^{Wi^k-Wi^kf = 0{slogp/n). 
1=1,...,p n 

k=l 

To see this, observe that for I £ 


1 ” . 

max - - Wi^kf 

1=1,...,p n ^ 
k=l 


= max ||X(0; - 0;)||2/n, 

< max (0; - 0i)^So(0i - 0z) 

1=1,...,p 

+ max |(0z - 0z)^(S - So)(0z - 0z) 
1=1,...,p 

< Amax(So) max ||0z-0z||i 

/=!,...,p 

+ ||S-So||oo max ||0z-0z||i. 

1=1,...,p 


By the above bound and by Lemma we observe 

1 ” - 

max - y^(lTz,fc - lTz,zc)^ = Op(slogp/n), 
1=1,...,p n 


k=\ 


as required. Rewriting this as 


1 " 

max — V(Wz,fc - bFz,fc)^ = C>(slogp/\/n), 

and taking square of both sides, we obtain 

1 ” 

max - y^(lTz,fc - Wi^kY 
1=1,...,p n ^ 


1 


A:=l 


< max 

1=1,...,P \ y/n 

= 0{{s\ogp/y/n) 


- Wz,A 


k=l 
‘i\ 


□ 
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9.2 Proofs for Section 13.21 


Theorem\^ By the decomposition (16), we have 

f - 00 = -0o(B - So)0o + (0oS - /)(0o - 0) + (00 - 0)^(B0 - I) ■ 


Under [AT A2 A3| by the proof of Lemmathere exist ci,C 2 such that with proba¬ 
bility at least it holds that 

||A||oo = 0{Crs\ogp/n). 

By Lemma 1^ in Appendix [Af with probability at least it holds 

|(0°f(^:-Eo)0°| = o(a/^/^). 


Hence there exists a constant C,- > 0 such that 


^ ^ogp\ 

\lij — 0 jj| > Ct max { s 


^/n' n J 


< C 3 max{p 


I-C 2 T „-C2T 


}■ 


But then since the constants C 2 , C 3 are universal, we can take the supremum over the 
model. By taking r sufficiently large it follows that 


0osS(s) 


sup P I Tij — 0 ^ • I > Ct max < , s 


1 log p \ 


y/n’’ n J 


< 046 


We proceed to show the second result of the Theorem. We have by Lemma 10 


m 


Appendix|^that there exist constants ci, C 2 such that with probability at least cip 
it holds 

||0o(S - So)0o||oo = 0(aVlogp/n). 

We have ||T— 0o||oo < Halloo + ||0o(S ~ Tlo)0o||oo- But then there exist constants 
Cr, Cl, C 2 > 0 such that 


\T- 0o||oo > Cr max 


logp logp 1 

-- } 

n n I 


< cip 


1—C2T 


□ 


9.3 Proofs for Section 13.31 

Lemma^ First note that 

f - 00 = P - 00 - - I) 

= H - 00 - 00 (SH - /) - (P - Oofitn - I) 

= -0o(S - So)0o - (0oS - I)(H - 0o) - (H - 0o)^(S^ - I) • 

'-V-' '-V-" 

remi rem 2 

Under [AT] and |A3| we have 

max ||S0° - Cjlloo = Cp(l/\/n). 
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For the first remainder we then have 

llremilloo = ||(0oS -/)(0 - 0o)||oo < ||0oS -/||oo|||0o - 5^ 

= 0^{s\ogp/n). 

For the second remainder we have 

llremsiloo = ||(^ - 0of -/)||oo < ||S^ -/|| 

= Ov{s\ogp/n). 

Collecting the above statements yields for z, j = 1,... ,p 

4-00. = (0Of(S-So)0O + Op(slogp/n) 

= (0Of(S-So)0O + op(l/\/^). 


Cl — 00 


and hence we have 


yn(4 - 04/ai,. ^ AA(0,1). 


□ 


A Concentration results for sub-Gaussian design 

Lemma 8. Let a, (3 G such that ||q :||2 < M, ||/3||2 < M. Let G satisfy the 
sub-Gaussianity assumption "XS with a constant K > 0. Then for m > 2, 


E|a^ XkXf /? - Ea^ XkX^ < 


Lemma 9. Let a^fi G such that ||q :||2 < M, ||/3||2 < M. Let X^ G satisfy the 
sub-Gaussianity assumption\A3\ with a constant K > 0. For all t > 0 


So/3|/(2M^ir^) > t + V2t < 2e 


.,—nt 


Lemma 10. Assume ||Q;j ||2 < M, ||/3||2 < M for all i = 1,... ,p and A3 with K. For 
all t > 0 it holds 


P ( max laHs - E„)/J|/(2M^A.^) > ( + VYt + F-LMM + !5iMl 


< e 


—nt 


For the proofs of Lemmas 8l 9 10 see Jankova and van de Geer (2015). 
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B Rates of convergence of the nodewise Lasso 

Lemma 11. Let Qj be obtained as in Q. Then it holds — ej — XjZj = 0. 
Lemma [771 The KKT conditions ([^ give 

— — iK^jjj)/n + Xjkj = 0 . 


Multiplying (20) by 7 ^ and since Kj = || 7 j||i 5 we obtain 

Tf=Xj{X,-X_,j,)ln. 


Rewriting Xj — X_j 7 j = XTj = X0jfj , from (21) we obtain 

XfXQj/n = 1. 


( 20 ) 


( 21 ) 


( 22 ) 


Similarly rewriting (20) gives 


X^jXQj/n =^Kj. 


(23) 


Combining (22) and (23) we obtain X^X0j jn — ej = XjZj, and rearranging, we get 
S 0 ,- — ej — XjZj = 0 as required. □ 
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